In this document, we’ll use examples from different submissions in assignment 2 to provide principles for ‘tidying’ plots:

Please don’t be discouraged if you see one of the plots you made in this document, you all did a great job of creating plots and these are suggestions to take your creations and make them better!

Reducing complexity

For readability of plots, it often makes sense break up information dense graphics into smaller “chunks”, especially if there are no constraints on the number of plots you can show your reader. You could think of facetting as a built in way the grammar facilitates this process, but there’s also nothing stopping you from creating an ‘ensemble’ of plots.

Let’s look at an example complex plot of the interaction between Q30, Q31, and Q17 could be achieved with facetting as follows:

This plot has packaged a lot of information and puts the focus on making comparisons between answers to Q17 (Do you have prior coding experience?).

We could instead make plots looking at combinations of the three variables above.

In this approach we learn the same things as the facetted plot, plus can make additional comparisons as we are longer forced to compare between the categories in Q17.

Remember one purpose of plots is to communicate what you’ve found in the data to the reader - a more complex plot forces a reader to take longer to understand your findings and has a narrow viewpoint. Breaking a complex plot into chunks allows your reader to slowly gain a richer understanding of the data.

Ordering

Which is easier to read. This:

or this:

In the first plot, we have to spend more time searching for what the most frequent category is, and it isn’t immediately obvious what the second or third most popular superpower is. By reordering the y-axis by the count this information is immediately perceived, and the less interesting information is pushed to the bottom of the axis.

Comparing counts

In lectures we discussed both bar charts and ‘100%’ charts (which are stacked bar charts but with proportions instead of counts). Consider examining the relationship between degree type and hours studied.

Here’s one groups plot to explore that relationship:

Because we are viewing a the explantory variable along the x-axis, the viewer knows the denominator (the total number of people within a category: [0,3) up to more than 12), and we can easily compare within a category along the x-axis by seeing the differences in heights of the bar.

However, to compare between categories the viewer has to make relative judgements between the same coloured rectangles between the answers to “Type of degree?” but this is no longer appropriate since the denominators are different. It’s difficult to see that relatively more people in single degrees study 3-6 hours compared to people in single degrees that study 6-9 hours. We need to normalise the counts within each category of Q30 to be able to compare the answers to Q12 (resulting in 100% chart):

An alternative to the 100% chart that generalizes to more than two categorical variables is the mosaic plot, which invites the reader to make comparisons between areas.

Here’s a redesign of the 100% chart as a mosaic chart using the R package ggmosaic:

The width of each rectangle along the x-axis represents the total number of responses to each category of question 8, and the heights represent the proportion of responses to Q12. From this plot we can immediatly see that the majority of students study more than 6 hours per week regardless of degree type.

Here’s another example. We would like to use examine the relationship between year in school and hours spent studying each week.

From the plot below a group stated:

“We can see overall 3rd year students put a lot more hours into study per week. This could perhaps be due to increased workload during the 3rd year as opposed to 1st year.”

To answer the original question you need to look at the distribution of hours studying, within each year.

Facet by year in school, and then make a bar chart in each facet. You can see that most students are in year 2 or 3, and the counts increase over hours spent studying. Both years have this pattern.

Because numbers are so small in all other groups, drop them, and focus only on years 2 and 3. Then we will make a mosaic plot to directly compare proportions in each hour category, by year. We can see that there is not much difference in the time spent studying each week by year. There is a very small increase for year 3 in the more than 12 hours, and decreases in 6-9, 9-12 hours. That means that there is a hint that third year students are studying more.

Sample size and aggregation

When a plot performs a statistical transformation of a variable, be aware of the sample size used to calculate the transformation. In a boxplot 5 numbers are computed to summarise a distribution, if the sample size is small, the boxplot can be misleading:

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Interpretation: People spending few study hours are spending too much time on the internet.

Correction: There’s only three students in the category, of spending too much time on the web and too little time studying.

Conditioning

We should condition the response variable (or the variable we are trying to understand) by other variables we think will explain the response. This allows us to make comparisons within the response varaible by our explanotory variable(s).

Final example: how does core or elective vary by year in school.

Which plot is appropriate here?

This:

or this:

or this: